Ty: Lossless Data Compression for Analytics-driven Query Processing
نویسندگان
چکیده
ARKATKAR, ISHA. ALACRI2TY: Lossless Data Compression for Analytics-driven Query Processing. (Under the direction of Nagiza F. Samatova.) Analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require us to look for alternative ways of performing query-driven analyses. This thesis is an attempt in the direction of query processing over losslessly compressed scientific data. We propose ALACRI2TY (Analytics-driven Lossless dAta Compression for Rapid In-situ Indexing, sToring, and querYing), which at its core consists of two components: lossless compressor and query processing engine over compressed data. ALACRI2TY’s compression component performs compression of double precision scientific data by unique value-based binning. Based on significant bit splitting, ALACRI2TY improves compression ratios over general-purpose compression utilities. It then indexes the metadata about the compression rather than the data to enable light-weight index storage. The query processing engine answers range queries over this compressed data with a low degree of unnecessary decompression. ALACRI2TY’s methodology involving compression and binning enables (1) Indexing with a total storage requirement (data+index) of less than 135% (versus 200-300% in existing scientific database systems); (2) Data access at multiple precision levels of detail necessitated by the varying sensitivity of analytical kernels (e.g., low-precision for histograms and descriptive statistics, medium-precision for clustering, and full-precision for Fourier analysis); (3) Robust performance across univariate as well as multi-variate query constraints via efficient bitmapbased aggregation of partial results. Altogether, these capabilities yield a multi-fold improvement in query response time over state-of-the-art systems such as FastBit, MonetDB, and SciDB when tested on several realworld data sets from scientific simulations and using the high-end compute clusters and Lustre file system at Oak Ridge National Laboratory. c © Copyright 2012 by Isha Arkatkar
منابع مشابه
Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying
The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We prop...
متن کاملALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying
High-performance computing architectures face nontrivial data processing challenges, as computational and I/O components further diverge in performance trajectories. For scientific data analysis in particular, methods based on generating heavyweight access acceleration structures, e.g. indexes, are becoming less feasible for ever-increasing dataset sizes. We present ALACRITY, demonstrating the ...
متن کاملLossless Microarray Image Compression by Hardware Array Compactor
Microarray technology is a new and powerful tool for concurrent monitoring of large number of genes expressions. Each microarray experiment produces hundreds of images. Each digital image requires a large storage space. Hence, real-time processing of these images and transmission of them necessitates efficient and custom-made lossless compression schemes. In this paper, we offer a new archi...
متن کاملImproving Compression Efficiency of Data Warehouse
Data compression has a paramount effect on Data warehouse for reducing data size and improving query processing. Distinct compression techniques are feasible at different levels, each of types either give good compression ratio or suitable for query processing. This paper focuses on applying lossless and lossy compression techniques on relational databases. The proposed technique is used at att...
متن کاملFactorized Databases: A Knowledge Compilation Perspective
This paper overviews recent work on compilation of relational queries into lossless factorized representations. The primary motivation for this compilation is to avoid redundancy in the representation of query results and speed up their computation and subsequent analytics.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011